── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(maps)
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
# A tibble: 105 × 2
# Groups: dest [105]
dest n
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
# ℹ 95 more rows
Ans: From all flights in 2013 departing from New York City’s three main airports, the top 10 destinations are shown above. The busiest destination is Chicago O’Hare (ORD) with 17,283 flights, followed by Atlanta (ATL) with 17,215 and Los Angeles (LAX) with 16,174. The rest of the top ten include BOS, MCO, CLT, SFO, FLL, MIA, and DCA, in that order.
# A tibble: 17 × 3
tailnum n airlines
<chr> <int> <chr>
1 N146PQ 2 9E, EV
2 N153PQ 2 9E, EV
3 N176PQ 2 9E, EV
4 N181PQ 2 9E, EV
5 N197PQ 2 9E, EV
6 N200PQ 2 9E, EV
7 N228PQ 2 9E, EV
8 N232PQ 2 9E, EV
9 N933AT 2 DL, FL
10 N935AT 2 DL, FL
11 N977AT 2 DL, FL
12 N978AT 2 DL, FL
13 N979AT 2 DL, FL
14 N981AT 2 DL, FL
15 N989AT 2 DL, FL
16 N990AT 2 DL, FL
17 N994AT 2 DL, FL
Ans: There were 17 planes that flew for more than one airline. The airplane tail numbers and their corresponding airlines are listed above.
Question 4
unique(weather$origin)
[1] "EWR" "JFK" "LGA"
length(unique(airports$faa))
[1] 1458
Ans: The weather data record hourly weather conditions for each airport (origin), and the airports data provide each airport’s location in America. Therefore, the relationship between airports and weather should be shown as one-to-many, meaning each of the three airports (EWR, JFK, and LGA) is linked to many weather observations in 2013.
Ans: The six duplicate records all occurred on the same date, November 3, 2013, at around 1:00 AM. This happened because of the Daylight Saving Time (DST) change, when the clock was set back from 2:00 AM to 1:00 AM. As a result, the hour “1 AM” was recorded twice for each airport, creating duplicate entries in the weather data.
Question 6
flights <- flights %>%left_join(weather, by =c("origin", "time_hour"))
Step 2: Check the size of the data
dim(flights)
[1] 336776 35
There are 336,776 observations and 35 variables in flights data after merged with weather data.
# A tibble: 6 × 35
year.x month.x day.x dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1842 NA NA 2019
2 2013 9 30 NA 1455 NA NA 1634
3 2013 9 30 NA 2200 NA NA 2312
4 2013 9 30 NA 1210 NA NA 1330
5 2013 9 30 NA 1159 NA NA 1344
6 2013 9 30 NA 840 NA NA 1020
# ℹ 27 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour.x <dbl>, minute <dbl>, time_hour <dttm>, dep_cat <chr>, arr_cat <chr>,
# red_flight <lgl>, year.y <int>, month.y <int>, day.y <int>, hour.y <int>,
# temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
# wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>
Step 5: Visualize the distributions of key variables
summary(flights$temp)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10.94 42.08 57.20 57.00 71.96 100.04 1573
summary(flights$dewp)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-9.94 26.06 42.80 41.63 57.92 78.08 1573
summary(flights$humid)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
12.74 43.99 57.73 59.56 75.33 100.00 1573
summary(flights$wind_dir)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 130.0 220.0 201.5 290.0 360.0 9796
summary(flights$wind_speed)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 6.905 10.357 11.114 14.960 42.579 1634
summary(flights$wind_gust)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
16.11 20.71 24.17 25.25 28.77 66.75 256391
summary(flights$precip)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00000 0.00000 0.00000 0.00456 0.00000 1.21000 1556
summary(flights$pressure)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
983.8 1012.7 1017.5 1017.8 1022.8 1042.1 38788
summary(flights$visib)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 10.000 10.000 9.256 10.000 10.000 1556
For key variables, I define them by the variables that can help us understand the weather phenomena. Therefore, I visualize 9 variables, including temperature, dewp, humid, wind speed, wind direction, wind gust, precip, pressure and visibility.
From the weather variables, we observed many missing values that make it difficult to fully determine the weather condition. In particular, wind gust and pressure have the most missing values. Therefore, we should be careful when handling missing data during the analysis to avoid bias or incorrect conclusions.
vars <-c("temp","dewp","humid","wind_dir","wind_speed","wind_gust","pressure", "precip")for (v in vars) { p <- flights %>%select(all_of(v)) %>%filter(!is.na(.data[[v]])) %>%ggplot(aes(x = .data[[v]])) +geom_histogram(color ="white", fill ="#990000") +labs(title =paste("Distribution of", v),x = v, y ="Count" ) +theme_minimal(base_size =13)print(p)}
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
flights %>%filter(!is.na(visib)) %>%ggplot(aes(x = visib)) +geom_histogram(binwidth =1, boundary =0, color ="white", fill ="#990000") +scale_x_continuous(breaks =seq(0, 10, 1)) +labs(title ="Distribution of visibility",x ="Visibility", y ="Count") +theme_minimal(base_size =13)
flights %>%filter(!is.na(precip)) %>%ggplot(aes(y = precip)) +geom_boxplot(fill ="#990000", alpha =0.7) +labs(title ="Boxplot of precipitation",y ="Precipitation", x =NULL) +theme_minimal(base_size =13)
I add another box plot for precip to check the distribution.
Considering length of delay (early/ontime treated as 0), the worst average departure delay occurred on March 8, 2013, with an average delay of 84.1 minutes.
Method: Weather is recorded at the origin–hour level, so many flights share the same weather record (clustering). To keep plots interpretable, I first averaged departure delays by origin × hour, then, for each weather variable, averaged again by that variable’s value; each point therefore represents the mean delay (min) at a given weather level.
Results: From the scatter plots and the correlation heat map, humidity shows the strongest positive association with mean departure delay (r ≈ +0.24), followed by dew point (≈ +0.19). Visibility is negatively related to delay (≈ −0.18), consistent with more delays under low-visibility conditions. Precipitation and wind gusts are also associated with higher delays, though with greater variability. Also, to make sure the results align, I repeated the plots by origin (JFK/LGA/EWR) for sensitivity analysis. The overall ranking was similar, although wind gusts and precipitation differed most across airports—suggesting airport-specific effects.
Conclusion: Descriptively, high humidity / high dew point and low visibility are most strongly linked to larger departure delays, with precipitation and gusts contributing in an airport-dependent way.
Limitations: Because multiple flights share the same hourly weather, correlations may overstate precision, and several weather variables are highly collinear (e.g., temp–dew point; wind speed–gusts). A mixed-effects model with random effects for origin and time would better adjust for clustering and quantify effect sizes.